[1] 2
[1] 11
[1] 0.4411887
[1] 2127969009576438321662422224286864628246682226688202424468224488848044260660204006846282880004802664460642226688200044466840040824846488266880020484242462244
Master 1 Economy & Psychology — Paris-Cité — 2025-2026
October 17, 2025
From Denise Earle on X
Tip
R is designed for scripts. Write your code in scripts, so that you can: (1) track everything you have done; (2) reproduce everything you have done; (3) share everything you have done.
expe1 and not understanding why the right columns are not showing? Been there, done that[1] 2
[1] 11
[1] 0.4411887
[1] 2127969009576438321662422224286864628246682226688202424468224488848044260660204006846282880004802664460642226688200044466840040824846488266880020484242462244
Tip
Anything that comes after a # in an R script is a comment that will never be run by R.
Comments are very useful for remembering why and how you wrote your script and for explaining it to yourself and others (think reproducibility).
adding_1 <- 1 + 1
subtract_1 <- 19 - 8
divide_1 <- log(x = 98, base = exp(1)) / sqrt(x = 108)
multiply_1 <- 78^83 * (1 / log(x = sqrt(x = 32^3), base = exp(1)))
adding_1 + subtract_1 / (divide_1 * multiply_1)[1] 2
Notes
The <- operator means that I am assigning what is on the right of the operator to the object that is on the left of the operator.
When defining and manipulating objects, R (1) stores them in the Environment window; and (2) prints the used code and the answer in the Console window.
Always use lowercase letters when naming your objects (R is case-sensitive) and avoid using diacritics or special characters. A good, albeit slightly long, name is lm_income_by_age_robustness_trimmed_1_99. This is a VERY BAD name: Predictor_âge_Sexe_robustness1_sansélèvesdéfaillants. Technically, it works. However, you’ll find yourself having to try very hard to write it correctly every time you need to.
x <- c(10, 4, 7, 12, 3, 18, 4, 12, 8) # Manually define an object x that takes multiple values
sqrt(x = x)[1] 3.162278 2.000000 2.645751 3.464102 1.732051 4.242641 2.000000 3.464102
[9] 2.828427
Note
In this case, the function only takes one argument, x. Usually, a function takes multiple arguments (x, y, dataframe, etc.).
x <- c(10, 4, 7, 12, 3, 18, 4, 12, 8) # Manually define an object x that takes multiple values
min_x <- function(x) { # Take the object x as an input
i <- which.min(x) # Find the smallest value of x
val <- x[i] + i # Add the line number to the smallest value
val * as.numeric(format(Sys.time(), "%H")) # Multiply the result by the hour of the day (0–23)
}
min_x(x = x)[1] 104
Note
In this case, the function only takes one argument, x. Usually, a function takes multiple arguments (x, y, dataframe, etc.).
base Rcontributed packages and are extremely usefulNote
Simply installing a package is not enough: you also need to load it each time you launch a new session.
dplyr packageBy Albert Y. Kim on X
haven packagenanoparquet packageNote
Errors may occur when loading data. CSV files can be temperamental, and there are multiple standards in use. The reason for this is that some standards use commas , as column separators, while some use semicolons ;. Moreover, some standards use periods . as decimal points and some use commas , as decimal points. Sometimes you need to dig in manually. This could be due to the column separator (does it use tabulation?), because the file contains row names that you did not specify (are you using a small x instead of a capital X?), etc… It can be a bit cumbersome — we’ve all been there!
Tip
You can make your life much easier by using File –> Import Dataset 😉
read.table() function?read.table() in the consoleheader = FALSE). Arguments with no option in the documentation indicate that a value is requiredfile = part is not necessary, as long as you respect the order of the arguments"" are necessary to indicate a string. Else, R will consider that Data/penguins.csv is an objectTRUE or T; FALSE or Fcolnames() returns the name of the columns (which are vectors)nrow() and ncol() respectively return the number of rows and of columnshead() returns the first rows of an object species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
str() returns the structure of the object'data.frame': 344 obs. of 8 variables:
$ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : chr "3750" "3800" "3250" NA ...
$ sex : chr "male" "female" "female" NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
chr, num and int indicate the type of data (vectors) that the dataframe containsNote
Can you spot the error(s) ?
summary() returns a summary of the object species island bill_length_mm bill_depth_mm
Length:344 Length:344 Min. :32.10 Min. :13.10
Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60
Mode :character Mode :character Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Length:344 Length:344 Min. :2007
1st Qu.:190.0 Class :character Class :character 1st Qu.:2007
Median :197.0 Mode :character Mode :character Median :2008
Mean :200.9 Mean :2008
3rd Qu.:213.0 3rd Qu.:2009
Max. :231.0 Max. :2009
NA's :2
describe(), well… describes the object vars n mean sd median trimmed mad min max
species* 1 344 1.92 0.89 2.00 1.90 1.48 1.0 3.0
island* 2 344 1.66 0.73 2.00 1.58 1.48 1.0 3.0
bill_length_mm 3 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6
bill_depth_mm 4 342 17.15 1.97 17.30 17.17 2.22 13.1 21.5
flipper_length_mm 5 342 200.92 14.06 197.00 200.34 16.31 172.0 231.0
body_mass_g* 6 342 44.87 24.67 42.00 44.19 29.65 1.0 94.0
sex* 7 333 1.50 0.50 2.00 1.51 0.00 1.0 2.0
year 8 344 2008.03 0.82 2008.00 2008.04 1.48 2007.0 2009.0
range skew kurtosis se
species* 2.0 0.16 -1.73 0.05
island* 2.0 0.61 -0.91 0.04
bill_length_mm 27.5 0.05 -0.89 0.30
bill_depth_mm 8.4 -0.14 -0.92 0.11
flipper_length_mm 59.0 0.34 -1.00 0.76
body_mass_g* 93.0 0.24 -1.06 1.33
sex* 1.0 -0.02 -2.01 0.03
year 2.0 -0.05 -1.51 0.04
Note
describe() treats non-numerical data as numerical and indicates doing so with ✱. Be careful!
datasummary_skim() describes the object, with a visual aid for numerical data| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 165 | 1 | 43.9 | 5.5 | 32.1 | 44.5 | 59.6 | |
| bill_depth_mm | 81 | 1 | 17.2 | 2.0 | 13.1 | 17.3 | 21.5 | |
| flipper_length_mm | 56 | 1 | 200.9 | 14.1 | 172.0 | 197.0 | 231.0 | |
| year | 3 | 0 | 2008.0 | 0.8 | 2007.0 | 2008.0 | 2009.0 |
datasummary_skim() describes the objectCaution
Warning: These variables were omitted because they include more than 50 levels: body_mass_g.
Notes
If you cannot remember to which package a function belongs, for instance datasummary_skim(), you can type ??datasummary_skim() in the console. This will take a while, as R will search through every package you have ever installed for the function.
TRUE or FALSE data. Can also take values NA or not, representing Missing Valuesbody_mass_g column, treating it as a character rather than a numeric value
Notes
Use the $ sign to access vectors stored in objects, such as a dataframe. This method is not 100% reliable, depending on the structure of the object, but you will learn as you go along.
Most of the time, you don’t need to work in base R to manipulate data. You will learn to use the tidyverse packages (Wickham et al. 2019), which will make your data manipulation much easier.
tidyverse and the pipe operator %>%rename() renames columns and filter() filters columns according to conditions%>%%>% operator takes the object on its left side and makes it the first argument of the function on its right side. Said differently, R can now work linearly!|> also exists. You can find a summary of the main differences here. Personally, I use the %>% operator out of habit, and also because it is easier to use with an AZERTY keyboardA tidyverse workflow, taken from Wickham, Cetinkaya-Rundel, and Grolemund (2023).
NAs)id | group | before | after |
|---|---|---|---|
1 | control | 40.48391 | 42.24692 |
2 | control | 52.06518 | 54.77531 |
3 | control | 53.46311 | 40.54112 |
4 | control | 52.94401 | 53.91534 |
5 | control | 53.86674 | 51.82061 |
6 | test | 45.80264 | 52.06510 |
7 | test | 54.11571 | 57.03168 |
8 | test | 49.76629 | 66.47595 |
9 | test | 56.28729 | 56.44203 |
10 | test | 53.39197 | 64.62119 |
Note
A pivot_wider() function is also available, depending on the required pivoting.
id | group | time | value |
|---|---|---|---|
1 | control | before | 40.48391 |
1 | control | after | 42.24692 |
2 | control | before | 52.06518 |
2 | control | after | 54.77531 |
3 | control | before | 53.46311 |
3 | control | after | 40.54112 |
4 | control | before | 52.94401 |
4 | control | after | 53.91534 |
5 | control | before | 53.86674 |
5 | control | after | 51.82061 |
6 | test | before | 45.80264 |
6 | test | after | 52.06510 |
7 | test | before | 54.11571 |
7 | test | after | 57.03168 |
8 | test | before | 49.76629 |
8 | test | after | 66.47595 |
9 | test | before | 56.28729 |
9 | test | after | 56.44203 |
10 | test | before | 53.39197 |
10 | test | after | 64.62119 |
table() and prop.table()table() function. It won’t get you very far, but it is sometimes enough to quickly skim through your data (check how many categories a column has; whether some categories need to be merged together; the number of observations per category…) species
sex Adelie Chinstrap Gentoo
female 73 34 58
male 73 34 61
prop.table() function species
sex Adelie Chinstrap Gentoo
female 21.92192 10.21021 17.41742
male 21.92192 10.21021 18.31832
Careful!
We did not check for missing data thoroughly enough. Reporting these tables without further exploring whether (and how much) data is missing would be a bold move.
df %>%
summarize(
mean = mean(body_mass_g),
sd = sd(body_mass_g),
n = n(),
se = sd / sqrt(n),
confint_low = mean - 1.96 * se,
confint_high = mean + 1.96 * se,
.by = sex) sex mean sd n se confint_low confint_high
1 male 4545.685 787.6289 168 60.76689 4426.581 4664.788
2 female 3862.273 666.1720 165 51.86142 3760.624 3963.921
3 <NA> NA NA 11 NA NA NA
df %>%
drop_na() %>%
summarize(
mean = mean(body_mass_g),
sd = sd(body_mass_g),
n = n(),
se = sd / sqrt(n),
confint_low = mean - 1.96 * se,
confint_high = mean + 1.96 * se,
.by = sex) sex mean sd n se confint_low confint_high
1 male 4545.685 787.6289 168 60.76689 4426.581 4664.788
2 female 3862.273 666.1720 165 51.86142 3760.624 3963.921
drop_na() function to the entire dataset does not alter the results. However, most of the time it will. It would be better to do the following. Why?df %>%
select(body_mass_g, sex) %>%
drop_na() %>%
summarize(
mean = mean(body_mass_g),
sd = sd(body_mass_g),
n = n(),
se = sd / sqrt(n),
confint_low = mean - 1.96 * se,
confint_high = mean + 1.96 * se,
.by = sex) sex mean sd n se confint_low confint_high
1 male 4545.685 787.6289 168 60.76689 4426.581 4664.788
2 female 3862.273 666.1720 165 51.86142 3760.624 3963.921
flextable() package 1/2Sex | Mean body mass (g) | sd | Number of penguins | se | 95% confidence interval [lower] | 95% confidence interval [upper] |
|---|---|---|---|---|---|---|
Male | 4,545.68 | 787.63 | 168 | 60.77 | 4,426.58 | 4,664.79 |
Female | 3,862.27 | 666.17 | 165 | 51.86 | 3,760.62 | 3,963.92 |
flextable() package 2/2# Install the package and load it
install.packages("flextable")
library("flextable")
# Construct the table
df %>%
select(body_mass_g, sex) %>%
drop_na() %>%
summarize(
mean = mean(body_mass_g),
sd = sd(body_mass_g),
n = n(),
se = sd / sqrt(n),
confint_low = mean - 1.96 * se,
confint_high = mean + 1.96 * se,
.by = sex) %>%
rename("Sex" = "sex",
"Mean body mass (g)" = "mean",
"Number of penguins" = "n",
"95% confidence interval [lower]" = "confint_low",
"95% confidence interval [upper]" = "confint_high") %>%
mutate(across(where(is.numeric), ~ round(.x, 2)),
Sex = ifelse(test = Sex == "male", yes = "Male", no = "Female")) %>%
# Format the table
flextable() %>%
bold(part = "header") %>%
autofit() # %>%
# save_as_docx(path = "/Tables/Penguins_table_1.docx") # Or save_as_html()knitr() and kableExtra() packageslibrary("knitr")
library("kableExtra")
df %>%
select(body_mass_g, sex) %>%
drop_na() %>%
summarize(
mean = mean(body_mass_g),
sd = sd(body_mass_g),
n = n(),
se = sd / sqrt(n),
confint_low = mean - 1.96 * se,
confint_high = mean + 1.96 * se,
.by = sex) %>%
rename("Sex" = "sex",
"Mean body mass (g)" = "mean",
"Number of penguins" = "n",
"95% confidence interval [lower]" = "confint_low",
"95% confidence interval [upper]" = "confint_high") %>%
mutate(across(where(is.numeric), ~ round(.x, 2)),
Sex = ifelse(Sex == "male", "Male", "Female")) %>%
kable(format = "latex", booktabs = TRUE, align = "c") %>%
kable_styling(latex_options = c("hold_position", "striped")) %>%
save_kable("Tables/Penguins_table_1.tex")gtsummary packageCareful!
We were right to be careful: 11 observations had missing data for penguins’ sex!
df %>%
select(body_mass_g, sex, species, bill_length_mm, bill_depth_mm, year) %>%
mutate(body_mass_pounds = body_mass_g / 453.59237,
sex = case_when(sex == "female" ~ "Femelle",
sex == "male" ~ "Mâle")) %>%
rename("Espèce" = "species",
"企鵝喙的長度(單位:毫米)" = "bill_length_mm",
"عمق المنقار (مم) )" = "bill_depth_mm") %>%
filter(body_mass_pounds >= (mean(body_mass_pounds, na.rm = TRUE) + sd(body_mass_pounds, na.rm = TRUE))) %>%
drop_na() %>%
tbl_summary(by = "sex",
statistic = list(all_continuous() ~ "{mean} ({sd})"))| Characteristic | Femelle N = 51 |
Mâle N = 561 |
|---|---|---|
| Espèce | ||
| Gentoo | 5 (100%) | 56 (100%) |
| island | ||
| Biscoe | 5 (100%) | 56 (100%) |
| 企鵝喙的長度(單位:毫米) | 46.16 (1.76) | 49.59 (2.76) |
عمق المنقار (مم) |
14.44 (0.65) | 15.76 (0.75) |
| flipper_length_mm | 214 (5) | 222 (6) |
| body_mass_g | 5,140 (65) | 5,534 (276) |
| year | ||
| 2007 | 1 (20%) | 17 (30%) |
| 2008 | 3 (60%) | 19 (34%) |
| 2009 | 1 (20%) | 20 (36%) |
| body_mass_pounds | 11.33 (0.14) | 12.20 (0.61) |
| 1 n (%); Mean (SD) | ||
\[ \begin{align} \text{Model 1: } Y_i &= \alpha + \varepsilon_i \\ \text{Model 2: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \varepsilon_i \\ \text{Model 3: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \lambda \, X_{i,\text{bill depth}} + \varepsilon_i \\ \text{Model 4: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \lambda \, X_{i,\text{bill depth}} + \gamma \, (X_{i,\text{flipper length}} \times X_{i,\text{bill depth}}) + \varepsilon_i \end{align} \]
summary() function
Call:
lm(formula = body_mass_g ~ 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-1501.8 -651.8 -151.8 548.2 2098.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4201.75 43.36 96.89 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 802 on 341 degrees of freedom
(2 observations effacées parce que manquantes)
summary() function
Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm, data = df)
Residuals:
Min 1Q Median 3Q Max
-1058.80 -259.27 -26.88 247.33 1288.69
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5780.831 305.815 -18.90 <0.0000000000000002 ***
flipper_length_mm 49.686 1.518 32.72 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 394.3 on 340 degrees of freedom
(2 observations effacées parce que manquantes)
Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
F-statistic: 1071 on 1 and 340 DF, p-value: < 0.00000000000000022
summary() function
Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm,
data = df)
Residuals:
Min 1Q Median 3Q Max
-1029.78 -271.45 -23.58 245.15 1275.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6541.907 540.751 -12.098 <0.0000000000000002 ***
flipper_length_mm 51.541 1.865 27.635 <0.0000000000000002 ***
bill_depth_mm 22.634 13.280 1.704 0.0892 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 393.2 on 339 degrees of freedom
(2 observations effacées parce que manquantes)
Multiple R-squared: 0.761, Adjusted R-squared: 0.7596
F-statistic: 539.8 on 2 and 339 DF, p-value: < 0.00000000000000022
summary() function
Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm +
flipper_length_mm:bill_depth_mm, data = df)
Residuals:
Min 1Q Median 3Q Max
-938.88 -253.96 -28.25 220.66 1048.33
Coefficients:
Estimate Std. Error t value
(Intercept) -36097.064 4636.271 -7.786
flipper_length_mm 196.074 22.603 8.675
bill_depth_mm 1771.796 273.003 6.490
flipper_length_mm:bill_depth_mm -8.596 1.340 -6.414
Pr(>|t|)
(Intercept) 0.0000000000000855 ***
flipper_length_mm < 0.0000000000000002 ***
bill_depth_mm 0.0000000003057955 ***
flipper_length_mm:bill_depth_mm 0.0000000004782702 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 371.8 on 338 degrees of freedom
(2 observations effacées parce que manquantes)
Multiple R-squared: 0.787, Adjusted R-squared: 0.7851
F-statistic: 416.2 on 3 and 338 DF, p-value: < 0.00000000000000022
modelsummary packagelibrary("modelsummary")
list_lm <- list(
"Model 1" = lm1,
"Model 2" = lm2,
"Model 3" = lm3,
"Model 4" = lm4)
modelsummary(list_lm,
stars = T,
fmt = 2,
vcov = "HC3",
title = "Table 1: An example of a couple of different linear models.",
notes = "Notes. Y = Body mass (in grams). Robust standard errors 'HC3'. Noice, right!?",
coef_rename = c(
"flipper_length_mm" = "Flipper length (mm)",
"bill_depth_mm" = "Bill depth (mm)",
"flipper_length_mm:bill_depth_mm" = "Flipper length (mm) X Bill depth (mm)"),
gof_map = c("nobs", "adj.r.squared"))modelsummary package| Model 1 | Model 2 | Model 3 | Model 4 | |
|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
| Notes. Y = Body mass (in grams). Robust standard errors 'HC3'. Noice, right!? | ||||
| (Intercept) | 4201.75*** | -5780.83*** | -6541.91*** | -36097.06*** |
| (43.43) | (294.88) | (500.35) | (4389.62) | |
| Flipper length (mm) | 49.69*** | 51.54*** | 196.07*** | |
| (1.45) | (1.75) | (21.46) | ||
| Bill depth (mm) | 22.63+ | 1771.80*** | ||
| (12.18) | (259.54) | |||
| Flipper length (mm) X Bill depth (mm) | -8.60*** | |||
| (1.28) | ||||
| Num.Obs. | 342 | 342 | 342 | 342 |
| R2 Adj. | 0.000 | 0.758 | 0.760 | 0.785 |
From Peter Peregrine’s blog
afex package (see its Github)glm() function (see here)lme4 package. See Bates et al. (2015) and Kuznetsova, Brockhoff, and Christensen (2017)fixest (Bergé 2018) or plm (Croissant and Millo 2008) packagesmarginaleffects package (Arel-Bundock, Greifer, and Heiss 2024). See also the accompanying free e-book Model to Meaning, as well as Rohrer & Arel-Bundock (2025)ggplot2 package (Wickham 2016, with the free e-books ggplot2: Elegant Graphics for Data Analysis and R Graphics Cookbook)sandwich package (Zeileis, Köll, and Graham 2020), which is incorporated in modelsummary (see its vcov argument here)pwr package (see its vignette)lavaan package for EFAs, CFAs, SEMs… (Rosseel 2012, with its website)survey package (Lumley 2004, with online vignettes and a nice free e-book)performance package (Lüdecke et al. 2021 and its vignettes)tidyverse) have a really nice webpage. They sometimes come with a Cheat Sheet (here for dplyr and here for tidyr). They might even have an accompanying book (Wickham, Cetinkaya-Rundel, and Grolemund 2023) which can be freely accessed onlinedplyr package 1/2dplyr package 2/2